Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap

نویسندگان

  • Gavin A. Price
  • Gavin E. Crooks
  • Richard E. Green
  • Steven E. Brenner
چکیده

MOTIVATION Protein sequence comparison methods are routinely used to infer the intricate network of evolutionary relationships found within the rapidly growing library of protein sequences, and thereby to predict the structure and function of uncharacterized proteins. In the present study, we detail an improved statistical benchmark of pairwise protein sequence comparison algorithms. We use bootstrap resampling techniques to determine standard statistical errors and to estimate the confidence of our conclusions. We show that the underlying structure within benchmark databases causes Efron's standard, non-parametric bootstrap to be biased. Consequently, the standard bootstrap underpredicts average performance when used in the context of evaluating sequence comparison methods. We have developed, as an alternative, an unbiased statistical evaluation based on the Bayesian bootstrap, a resampling method operationally similar to the standard bootstrap. RESULTS We apply our analysis to the comparative study of amino acid substitution matrix families and find that using modern matrices results in a small, but statistically significant improvement in remote homology detection compared with the classic PAM and BLOSUM matrices. AVAILABILITY The sequence sets and code for performing these analyses are available from http://compbio.berkeley.edu/. CONTACT [email protected].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cumulative Errata: A memorandum of minor miscellaneous mistakes.

Statistical evaluation of pairwise protein sequence comparison with the Bayesian boot-strap. from nonequilibrium work data in the presence of instrument noise.modynamic metrics and optimal paths.

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

Hyperbolic Cosine Log-Logistic Distribution and Estimation of Its Parameters by Using Maximum Likelihood Bayesian and Bootstrap Methods

‎In this paper‎, ‎a new probability distribution‎, ‎based on the family of hyperbolic cosine distributions is proposed and its various statistical and reliability characteristics are investigated‎. ‎The new category of HCF distributions is obtained by combining a baseline F distribution with the hyperbolic cosine function‎. ‎Based on the base log-logistics distribution‎, ‎we introduce a new di...

متن کامل

Molecular Identification of the Persian Gulf Sea Hare (Aplysia sp.) Based on 16s rRNA Gene Sequence

Background: Sea hares of the Aplysia genus are among the mollusks of interest for various researchers to study their phylogeny, bioactive compounds and the nervous system. These mollusks are herbivorous and produce chemical compounds (ink) to defend themselves. The present study provided molecular identification of the Persian Gulf (Bushehr city) sea hare using 16s rRNA gene sequence. Materials...

متن کامل

Improving the Performance of Bayesian Estimation Methods in Estimations of Shift Point and Comparison with MLE Approach

A Bayesian analysis is used to detect a change-point in a sequence of independent random variables from exponential distributions. In This paper, we try to estimate change point which occurs in any sequence of independent exponential observations. The Bayes estimators are derived for change point, the rate of exponential distribution before shift and the rate of exponential distribution after s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 21 20  شماره 

صفحات  -

تاریخ انتشار 2005